Rufus de Rham 
Activist Archivists 
December 12, 2012 

Technical Metadata In Popular Video Sharing Sites 


ABSTRACT 

This paper will explore the embedded technical metadata from born digital content 
from three separate sources (iPhone 4S, Motorola Droid 2 , and Canon t2i) and how 
it changes as it is exported to three of the largest video sharing sites (YouTube, 
Vimeo, and Internet Archive) and then scraped or downloaded on all three sites. 

INTRODUCTION 

On September 17, 2011 people began gathering in Zuccotti Park in New York City’s 
financial district with the goal of peacefully occupying Wall Street in protest against 
increasing corporate influence in our political system, the lack of legal repercussions 
for bankers behind the global financial crisis, and the growing disparity of wealth in 
the country. Combining the occupation of a symbolic place as in the Tahrir Square 
occupation as well as the consensus based democracy of the 15-M protest 
movement in Spain, what started as a small movement has grown into a worldwide 
movement with many cities around the country and the world being occupied in 
solidarity. Like the Arab Spring protests much of this movement is being 
documented and disseminated through the internet as the technologically adept 
protesters stream content, upload video, photos and audio, and take to Twitter and 
other social networking sites to further their message and cause. 

Given the ephemeral nature of digital video, as well as the potential lack of access to 
source media due to police confiscation, destruction, geographical distance, or 
various other reasons, much of this video exists only on these video sharing sites. 
Archiving this content, along with providing access, in different repositories is 
critically important. With the original formats no longer accessible for the most part, 
and the improbability of tracking down users who uploaded these items to get these 
original formats makes this collection method improbable. The nature of these 
movements, and most likely most activist movements moving forward, make the 
scraping of internet video a key ingredient in collecting this material. 

This paper and project come out of the Activist Archivists working group that was 
formed in late October 2011 by NYU MIAP students, alumni, professors and working 
professionals in the field of archiving. From our mission statement: 

We are a group of media archivists who support the Occupy Wall 
Street movement and its ideals. We aim to share knowledge and 
provide assistance on archiving and preservation matters in order to 



improve the discoverability of the video that is being produced; to 
support the usability of video as evidence; to ensure that the rights 
and intentions of media creators are respected; and to ensure that the 
legacy of the movement persists through open access. While we are 
not officially affiliated with OWS, we share its commitment to 
participation, collaboration, and transparency in our work. 1 

In working with other institutions, including Tamiment Library and the Internet 
Archive, it became clear that while scraping was the discussed means of collecting 
the videos no one really knew what embedded metadata was being lost or added 
during the uploading/downloading/scraping processes. I proposed a metadata test, 
which would utilize the iPhone, the Droid, and a Canon DSLR to simulate three of the 
common devices used to film the protests. The footage would then be uploaded to 
YouTube, Vimeo and the Internet Archive as these are three of the largest and most 
easily accessible video sharing sites in the United States. The video would then be 
downloaded directly or scraped (in the case of Vimeo both) and the metadata would 
be compared to the raw file. 


METHODOLOGY 

Using each device I took 30 seconds of video. This video was uploaded to YouTube, 
Vimeo, and Internet Archive directly from the device if possible and then from the 
computer after the raw file had been taken from the device. Chrome, the default 
browser on the computer used, was used to upload the files to the three sites. 
Firefox with the plugin DownloadHelper was used to scrape the video from 
YouTube and Vimeo. As a Plus Account user of Vimeo I also allowed the download of 
the source files on the videos and downloaded those. After uploading to Internet 
Archive I waited for the files to propagate and downloaded the source file. Each file 
was put through mediainfo on full information display and exported as a text file. A 
MD5 checksum was also generated for each file. This information was then imported 
into an excel spreadsheet. The following is the methodology used for each source. 
Machine and device profiles can be found in Appendix 1. The results can be found in 
the attached Excel spreadsheet as Appendix 2. 

iPhone 4s 

The iPhone 4s was chosen because it shoots 1080p HD video and has location 
services. It is also shares the bulk of the smartphone market with the phones from 
the Android platform. iPhone 4s also has GPS capabilites and location services were 
turned on for the experiment. The iPhone was the only device to upload to both 
YouTube and Vimeo directly from the device, as there are officially supported apps 
for both. The raw video was downloaded over USB 2.0 connection into iPhoto '09. 


1 Activist Archivist Mission Statement Draft written by Yvonne Ng in collaboration with the group. 



YouTube was scraped with the video on the highest setting, and Vimeo was scraped 
with HD turned on and again with it turned off. The following are the workflows and 
the outputs (in bold), which correspond to the columns in the attached spreadsheet: 

iPhone -> MacBook -> mediainfo/md5: iPhone Raw (via iPhoto) 

iPhone -> YouTube -> Firefox/DownloadHelper -> mediainfo/md5: iPhone Direct 
Upload YouTube Scrape 

iPhone -> MacBook -> YouTube -> Firefox/DownloadHelper -> mediainfo/md5: 

iPhone Comp to YouTube Scrape 

iPhone -> Vimeo -> Firefox -> mediainfo/md5: iPhone Direct Vimeo Source 

iPhone -> Vimeo -> Firefox/DownloadHelper -> mediainfo/md5: iPhone Direct 
Vimeo Scrape HD/ iPhone Direct Vimeo Scrape SD 

iPhone -> MacBook -> Vimeo -> Firefox -> mediainfo/md5: iPhone Comp to Vimeo 

iPhone -> MacBook -> Vimeo -> Firefox/DownloadHelper -> mediainfo/md5: 

iPhone Comp to Vimeo Scrape HD/ iPhone Comp to Vimeo Scrape SD (HD 

turned on and off) 

iPhone -> MacBook -> Internet Archive -> Firefox -> mediainfo/md5: iPhone 
Computer Internet Archive 

Droid 2 

The Droid 2 was chosen as it also allows GPS tagging of videos and direct upload to 
YouTube from the phone. It is also representative of the Android platform. At the 
time of the test there was no officially supported Vimeo app for the Android 
platform so direct upload to Vimeo from the Droid was not possible. The phone was 
then plugged into the computer via USB and using the phone as an external drive the 
raw video file was put onto the computer. YouTube was scraped at the highest 
quality setting and only one scrape was used for Vimeo as the output of the Droid 
was not HD. The following are the workflows and the outputs (in bold), which 
correspond to the columns in the attached spreadsheet: 

Droid 2 -> MacBook -> mediainfo/md5: Droid 2 Raw 

Droid 2 -> YouTube -> Firefox/DownloadHelper -> mediainfo/md5: Droid 2 Direct 
Upload YouTube Scrape 

Droid 2 -> MacBook -> YouTube -> Firefox/DownloadHelper -> mediainfo/md5: 

Droid 2 Comp to Youtube Scrape 



Droid 2 -> MacBook -> Vimeo -> Firefox -> mediainfo/md5: Droid 2 Vimeo 
Download 

Droid 2 -> MacBook -> Vimeo -> Firefox/DownloadHelper -> mediainfo/: Droid 2 
Vimeo Scrape 

Droid 2 -> MacBook -> Internet Archive -> Firefox -> mediainfo/md5: Droid 2 
Comp to IA 

Canon t2i 

The Canon t2i is a DSLR camera that has the capability of shooting high definition 
video. It was chosen as it is (relatively) cheap and the video sensor is the same as its 
bigger cousin the Canon 7D. This makes it a popular camera for amateur and 
prosumer video creators. It does not have internet capability or GPS tagging built in 
so all uploads were done from the computer. YouTube was scraped at the highest 
quality and Vimeo was scraped twice, with HD turned on and off. The following are 
the workflows and the outputs (in bold), which correspond to the columns in the 
attached spreadsheet: 

t2i-> MacBook -> mediainfo/md5: t2i Raw 

t2i -> MacBook -> YouTube -> Firefox/DownloadHelper -> mediainfo/md5: t2i 

YouTube Scrape 

t2i-> MacBook -> Vimeo -> Firefox -> mediainfo/md5: t2i Vimeo Source 

t2i -> MacBook -> Vimeo -> Firefox/DownloadHelper -> mediainfo/md5: t2i Vimeo 
Scrape HD/ t2i Vimeo Scrape SD (HD turned on and off) 

t2i -> MacBook -> Internet Archive -> Firefox -> mediainfo/md5: t2i Internet 
Archive Source 


FINDINGS 

In the Activist Archivist meeting the two major issues brought up in terms of 
embedded metadata were the recording/creation date as well as location 
information. Given the potential evidentiary value of these videos metadata such as 
date, time and location is crucial and was the initial focus of this study. 

Recording Date 


Only the iPhone video, wrapped in Quicktime .mov, had a Recorded date field. Every 



device used the Encoded date and Tagged date fields to store this information. Both 
Vimeo and Internet Archive source file downloads keep this information and the 
fields. The differences across the board in the File last modification date and File last 
modification date (local) fields make this field useless for establishing recording 
time. The scrapes from YouTube and Vimeo however do away with the Recorded 
Date field (which is perhaps indicative that it is a field from the wrapper itself) and 
alter the Encoded date and Tagged date. 

Across all tests the Encoded date and Tagged date fields were severely altered when 
using scraping from the streaming services. This is due to the derivative files 
created for streaming using their own creation date in these fields. None of the sites 
use this information for descriptive metadata so unless the information is verbally 
or visually input into the video (through narration, taking video of street signs, etc) 
it is impossible to determine date of origin through scraping the video. 

GPS Information 

Both iPhone 4s and the Droid 2 allow for GPS to be tagged when creating video. The 
Droid 2 has this as an optional feature while the iPhone automatically does this if 
Location Services are turned on. The GPS data for the iPhone is found in the footer of 
the file. In the QuickTime specification hold these in a Metadata atom handler-type 
key mdta. When viewed through mediainfo this information exists in the Model field 
as well as the com.apple.quicktime. location. ISO6709 field. 2 These fields and 
information are kept when downloading the source files from Vimeo and Internet 
Archive, but are destroyed when the files are converted from their QuickTime 
wrapper to stream on YouTube and Vimeo. Strangely the location tag on the Droid 2 
is not embedded into the actual file itself, and at the time of this writing the data was 
unable to be located even on the phone itself. The Droid 2 also allows users to tag 
their videos internally, but none of this information is transferred or embedded with 
the file. The only way to view this may be to root the phone but this would break all 
ToS contracts with Verizon and void the phone so it wasn’t done for the test. 

YouTube seemingly pulls location data from the uploaded source files and puts in 
into a descriptive field on the YouTube page, but the field remained unchanged 
during all tests. However the user can manually enter the GPS location to geotag the 
video within the YouTube user interface. 

Other Findings 

All of the video formats studied used the AVC (Advanced Video Coding) format, also 
known as H.264/MPEG-4 Part 10 or ISO/IEC 14496-10:2003. Information 
technology — Coding of audio-visual objects — Part 10: Advanced Video Coding. This 
technology is used in streaming video and allows higher quality video at the same 


2 QuickTime Metadata Profile: 

http://developer.apple.eom/library/mac/#documentation/QuickTime/QTFF/Metadata/Metadata.html 



data rate as previous iterations. Both the H.264 and MPEG-4 standards are 
maintained together so they have identical technical profiles. 3 Because both Vimeo 
and YouTube (as well as the devices in the test) use this codec even with the severe 
compression (the YouTube streaming file was about 10% the size of the source file 
for iPhone, 17% for the Droid and 10% for the t2i) needed for streaming video, the 
subjective video quality does not suffer from dropout or other noticeable video 
degradation. 

The audio codec for both iPhone and Droid was AAC LC, or AAC (MPEG-4) Low 
Complexity Object and is part of the ISO/IEC 14496-3:2001. Information technology 
— Coding of audio-visual objects — Part 3: Audio. Which is also used in the MPEG-4 
profile used in both YouTube and Vimeo for their streaming services. The t2i uses 
the SWOT Little Endian PCM Audio (sowt) codec, which is part of the Apple 
QuickTime profile. Again we see the streaming file size in terms of audio is around 
10% of the source across the board. The audio on all three devices is rather poor so 
it was difficult to do a subjective analysis of the audio, but the compression seemed 
to have not affected the quality that much. 

The other finding is there was no difference between a direct upload from the phone 
over from the computer. The same metadata was kept and/or discarded either way. 
As expected the ability to download the source files from both the Vimeo Plus 
account and the Internet Archive allowed the embedded metadata to be maintained 
throughout the process and all files passed their checksum test. Scraping from the 
streaming service only captured the severely compressed derivative files and 
important metadata was lost. These files failed the their checksum test. 

CONCLUSIONS 

The test’s main goal was to determine what the best source of video for collecting 
institutions would be, and which platforms would be recommended for content 
creators to share with the institutions. Based on this initial test and findings it is 
clear that the Internet Archive is the best way for content creators to openly share 
video on the web. The Internet Archive offers source downloads as well as robust 
licensing for Creative Commons and Public Domain footage it is also free to use and 
there are no download or upload caps. It also automatically generates downloadable 
xml files that contain checksum information on the source file and its derivatives, as 
well as the descriptive metadata displayed on the item’s page. The Internet Archive 
also strives to maintain user anonymity and is a great digital repository. The only 
caveats are that the upload process can be buggy at times, and the user interface can 
be hard to navigate. 

Vimeo is the second choice, but only if the user is a Vimeo Plus account holder. 
Vimeo Plus is $59.95 and allows users much more control over their videos and how 
they are embedded, a 5 GB a week upload limit, unlimited HD uploads and source 


3 MPEF-4 format profile: http://www.digitalpreservation.gov/formats/fdd/fdd000155.shtml 



file downloads . 4 Vimeo, like Internet Archive, allows for robust Creative Commons 
licenses. If the user is not a Vimeo Plus account holder the source files are deleted 
after a week. The other issue is that Vimeo caps basic members at five downloads 
per day account. Therefore both the collecting institution and the user would need 
to be a Plus account holder for Vimeo to be a viable platform for exchange. Scraping 
from Vimeo gives us compressed derivative files, which have the date and location 
metadata stripped out. 

YouTube is a great platform for sharing content and getting the most hits, but it is 
absolutely the last choice for archival content. There is no way to download the 
source file, not even as a user. Unless the content creator has given his video robust 
descriptive content, it will not be easy to determine things like date and location 
information. There are only two licenses available to users, CC-BY and the standard 
YouTube license. Both Vimeo and YouTube lack an easy means to download 
descriptive metadata. The collecting institution would have to build a script to 
scrape the xml files from the page itself. 

YouTube is and will remain the platform of choice for video creators simply because 
it is the most used site. Institutions must keep in mind that scraping video violates 
the Terms of Service with both YouTube and Vimeo. However, this is the only real 
means of collection at this point. Through user education and institutional outreach 
we hope to encourage content creators to upload videos on as many platforms as 
possible as it increased visibility. If they desire professional repositories to keep and 
preserver their media in a way that follows best practices the choice is simple: 
Internet Archive. However scraping and collecting from Vimeo and YouTube is 
necessary to get the most content, although it is much more time intensive as 
descriptive metadata must be scraped as well and institutions need to be aware that 
the date and time metadata embedded in the file is wrong. 

Going forward this test should be expanded to other devices and other video sharing 
sites. More testing about GPS data on the Android platform is needed as well. George 
Mason University Center for History and New Media also recently launched 
OccupyArchive.org which should be tested as well, as it is one of the first sites to 
directly ask people from the Occupy movement to share files to the repository. 


4 http://vimeo.com/help/faq/vimeo_plus 



APPENDIX 1: MACHINE AND DEVICE PROFILES 


iPhone 4s 

iOS version 5.0.1 on Verizon Network 
Vimeo App 1.0.3 

OEM install of YouTube (unable to determine version number) 

Camera: 8MP camera 1080p HD video default, Location Services turned on 

Motorola Droid 2 

Android version 2.2 on Verizon Network 
OEM install of YouTube Version 2.3.4 

Camera: 5MP camera D1 (720 x 480) Resolution, GPS turned on 
Canon t2i 

Firmware version 1.0.9 

Video Settings: 1920 x 1080 24fps NTSC 

USB 2.0 connection used on all three to attach to computer. 

MacBook 7,1 

OSX 10.6.8 Intel Core 2 Duo 2.4GHz 2 (lxl) GB 1067 MHz DDR3 Ram 
Firefox 6.0.2 with DownloadHelper 4.9.7 extension on to scrape files. 
Medialnfo CL1 running MedialnfoLib v0.7.50 
MD5 CLI via terminal 
iPhoto '09 version 8.1.2 (424) 

Google Chrome version 15.0.874.121 to upload to YouTube, Vimeo and 
Internet Archive 



